This lab will run from 10-12 in the computer cluster on Nov 8th - Bo Yao will be on hand to assist. I will be available on our Slack channel in case you need extra help.

In this lab you’ll have the chance to go over a number of things that we covered before reading week. We will focus on data subsetting, data visualisation, and building some linear models.

We will be using two sets of packages - the tidyverse packages for dplyr, ggplot2 etc. and the gapminder package which includes some nice global data related to population size, life expectancy, GDP etc for a bunch of countries over a number of years.

First of all, load the tidyverse and gapmiinder packages to your library. Remember, if you haven’t installed either before, you’ll need to first type install.packages(“packagename”) in the console. Obviously, replace “packagename” with the name of the package you want to download!

library(tidyverse)
library(gapminder)

Frist we’re going to build a scatterplot of life expectancy against GDP per capita using the gapminder dataset. If you want to explore the dataset before you use it you could type >help(gapminder) or >str(gapminder)

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", 
       title = "Scatterplot of Life Expectancy against GDP per capita")

Now let’s separate out the data by Contintent.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) + geom_point() +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", 
       title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
       colour = "Continent")

We can also use facet_wrap to split by Continent which might make things a little easier to see.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) + geom_point() + facet_wrap(~continent) +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", 
       title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
       colour = "Continent") + guides(colour = FALSE) 

Now we’re going to focus on just one country, Sweden - famous for pop pioneers, Abba, and everyone’s favourite progressive death metal band, Opeth. No? Just mine then…

We will use the filter() function from the dplyr package to include only data from “Sweden” in our new variable I’m calling gapminder_sub1

gapminder_sub1 <- filter(gapminder, country == "Sweden")

Let’s plot a scatterplot of life expectancy against GDP and add a regression line.

ggplot(gapminder_sub1, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm") +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", title = "Sweden Life Expectancy by GDP")

We can formally calculate the regression line using the lm() function.

model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub1)
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6476 -0.3361  0.0124  0.1841  1.1721 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.853e+01  4.410e-01  155.38  < 2e-16 ***
## gdpPercap   3.834e-04  2.074e-05   18.49 4.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5311 on 10 degrees of freedom
## Multiple R-squared:  0.9716, Adjusted R-squared:  0.9687 
## F-statistic: 341.9 on 1 and 10 DF,  p-value: 4.618e-09

We can look at some diagnostic plots to make sure everything looks ok.

plot(model)

Generally things look fine although point 12 is just outside .5 of Cook’s distance. Now we’re going to do the same for the UK.

gapminder_sub2 <- filter(gapminder, country == "United Kingdom")
ggplot(gapminder_sub2, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm") +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", title = "UK Life Expectancy by GDP")

model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub2)
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75546 -0.29292 -0.03036  0.18865  0.99229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.515e+01  4.231e-01  153.96  < 2e-16 ***
## gdpPercap   4.527e-04  2.051e-05   22.07 8.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5026 on 10 degrees of freedom
## Multiple R-squared:  0.9799, Adjusted R-squared:  0.9779 
## F-statistic: 487.2 on 1 and 10 DF,  p-value: 8.166e-10
plot(model)

In this case, point 12 is looking a little extreme.

Now, we’re going to combine our two dataframes using the rbind() function. This function will take two separate dataframes and bind them together by rows (i.e., one on top of each other) to produce a new dataframe. You can also combine two dataframes side by side by binding together by columns using the cbind() function. Here we’re sticking with rbind() and we’re mapping the output of this onto a new variable I’m calling data_combined.

data_combined <- rbind(gapminder_sub1, gapminder_sub2)

We’re now going to produce a different kind of plot. We’re going to build a violin plot showing the distribution of life expectancy data for our two countries. We’re also adding some descriptive statistics using the stat_summary() function - we’re adding the mean and SE bars.

ggplot(data_combined, aes(x = country, y = lifeExp, fill = country)) + geom_violin() + 
  geom_jitter(width = .1, alpha = .3) + stat_summary(fun.data = mean_se, geom = "errorbar", width=.25) + 
  stat_summary(fun.y = mean, geom = "point", size = 4) + guides(fill = FALSE)  +
  labs(x = "Country", y = "Life Expectancy (years)", title = "Life Expectancy for Sweden and the United Kingdom")

Now let’s look at the countries in Europe using our filter() function again and life expectancy against GDP.

europe <- filter(gapminder, continent == "Europe")
ggplot(europe, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  labs(x = "GDP per capita", y = "Life Expectancy (years)", 
       title = "Life Expectancy against GDP per capita \nfor countries in Europe")

Now we’re going to plot boxplots of the Life Expectacy data for all the countries in Europe. Notice how I’m using the reorder() function to reorder the levels of our factor based on the median of each country’s Life Expectancy data.

ggplot(europe, aes(x = reorder(country, lifeExp, median), y = lifeExp, fill = country)) + geom_boxplot() + 
  coord_flip() + guides(fill = FALSE) + 
  labs(y = "Life Expectancy (years)", x = "Country", 
       title = "Boxplots of Life Expectancy in Countries in Europe over Time")

We’re now going to look at a dataset built into the tidyverse - it’s the diamonds dataset and contains the prices and other attributes of almost 54,000 diamonds. Let’s check the structure first using str()

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Let’s first plot Price against Clarity while also capturing Carat. Remember the alpha parameter in the geom_jitter() layer sets the level of translucency of each point - it can vary from 0 to 1. Change it from .1 to 1 to see what happens…

ggplot(diamonds, aes(x = clarity, y = price, colour = carat)) + geom_jitter(alpha = .1) + 
  labs(x = "Clarity (Worst to Best)", y = "Price in USD", colour = "Carat") 

Now let’s plot a histogram of diamond prices.

ggplot(diamonds, aes(x = price)) + geom_histogram(bins = 50) + 
  labs(x = "Price in USD", title = "Histogram of Diamond Prices")

We’re now going to look at histogrames of diamond prices separately for each type of diamond ‘cut’. Notice we’re also plotting in grey the histogram of the overall dataset. We’ve created a new dataset (called diamonds_all) using the select() function to allow us to plot this background via a separate call to geom_histogram().

diamonds_all <- select(diamonds, -cut)

ggplot(diamonds, aes(x = price)) + geom_histogram(data = diamonds_all, bins = 50, fill = "grey") + 
  geom_histogram(bins = 50) + facet_wrap(~cut) + 
  labs(x = "Price in USD", title = "Histograms of Diamond Prices Faceted by Diamond Cut")

Now let’s look at the whole data set again with Price plotted against type of diamond ‘cut’. We’re also adding some descriptive statistics using the stat_summary() function.

ggplot(diamonds, aes(x = cut, y = price, colour = carat)) + geom_jitter(alpha = .1) +
  labs(x = "Diamond Cut", y = "Price in USD", colour="Carat") +
  stat_summary(fun.data = mean_cl_boot, colour="red")

Let’s work out some descriptives ourselves using group_by() and summarise() from the dplyr package, alongside a bit of piping %>%. We’re working out (grouped by cut) the mean and standard deviations for price and carat.

diamonds %>% group_by(cut) %>% 
  summarise(mean_price = mean(price), sd_price = sd(price), count = n(), mean_carat = mean(carat),
            sd_carat = sd(carat))
## # A tibble: 5 x 6
##   cut       mean_price sd_price count mean_carat sd_carat
##   <ord>          <dbl>    <dbl> <int>      <dbl>    <dbl>
## 1 Fair           4359.    3560.  1610      1.05     0.516
## 2 Good           3929.    3682.  4906      0.849    0.454
## 3 Very Good      3982.    3936. 12082      0.806    0.459
## 4 Premium        4584.    4349. 13791      0.892    0.515
## 5 Ideal          3458.    3808. 21551      0.703    0.433

Let’s choose a mid-ranking clarity level and smaller diamonds (< .5 carats). We use the filter() function to do that. Can a diamond’s price be predicted by its carat level?

diamonds_sub <- filter(diamonds, clarity == "VS2" & carat < .5)

ggplot(diamonds_sub, aes(x = carat, y = price)) + geom_point() + geom_smooth(method = "lm") +
  labs(x = "Carats", y = "Price in USD", title = "Price in USD against Carats for \nMid-Range Clarity Small Diamonds")

The graph certainly suggests we can predict price if we know a diamond’s carat level. Is this supported by a linear model?

model <- lm(price ~ carat, data = diamonds_sub)
summary(model)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -377.84 -104.68   -3.57  121.74 1016.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -155.47      16.07  -9.675   <2e-16 ***
## carat        2668.40      46.51  57.372   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143.2 on 4084 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4461 
## F-statistic:  3292 on 1 and 4084 DF,  p-value: < 2.2e-16
plot(model)

Looks like we have a pretty nice fitting model with only one or two outliers.